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Abstract 

We present the first feasible method for sampling a dynamic data stream with deletions, where the 
sample consists of pairs (fc, Ck) of a value fc and its exact total count Cfe. Our algorithms are for 
both Strict Turnstile data streams and the most general Non-strict Turnstile data streams, where 
each element may have a negative total count. Our method improves by an order of magnitude 
the known processing time of each element in the stream, which is extremely crucial for data 
stream applications. For example, for a sample of size 0(e~ 2 log {1/5)) in Non-strict streams, our 
solution requires 0((loglog(l/e)) 2 + (loglog(l/<5)) 2 ) operations per stream element, whereas the 
best previous solution requires 0(e~ 2 log 2 (l/5)) evaluations of a fully independent hash function 
per element. Here 1 — S is the success probability and e is the additive approximation error. 

We achieve this improvement by constructing a single data structure from which multiple el- 
ements can be extracted with very high success probability. The sample we generate is useful for 
calculating both forward and inverse distribution statistics, within an additive error, with prov- 
able guarantees on the success probability. Furthermore, our algorithms can run on distributed 
systems and extract statistics on the union or difference between data streams. They can be used 
to calculate the Jaccard similarity coefficient as well. 
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1 Introduction 

Sampling is a fundamental component in data stream algorithms |19j . as a method to keep a 
synopsis of the data in memory sublinear to the size of the stream. The sample can be used 
to calculate stream statistics of interest such as the frequent items in the stream (also called 
heavy hitters) or the quantiles, when the stream itself cannot be fully kept in memory. 

In the most general data stream model, the data is a series of elements (xi, c,-) where Xi is 
an element's value and Cj is a count. A certain value may appear in the stream multiple times 
with various counts. Summing all counts of a value fc in the stream gives a pair (fc, Ck) where 
Ck = - x =k c i i s the total count of fc. A value fc with Ck = is a deleted element, and its 
effect on stream statistics should be as if it had never appeared in the stream. Particularly, 
it must not appear in a sample of the stream obtained by any sampling algorithm. 

We denote the function that maps the values to their frequencies by /, i.e. /(fc) = Ck. 
An exact sampling algorithm outputs a sample of the pairs (fc, /(fc)) composed of values and 
their exact total count. An e- approximate sampling algorithm for e S (0, 1) outputs a sample 
of the pairs (fc, /'(fc)), where /'(fc) e [(1 - e)/(fc), (1 + e)/(A)]. 

An e-approximate sample is sufficient for answering forward distribution queries, which 
concern properties of /, such as what is /(fc). However, it cannot be used for queries on the 
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inverse distribution function defined as / _1 (C) = || fc '. cv^ }| ^ or ^ ^ ^> *" e " ^ e fraction 
of distinct values with a total count equal to C. The reason is that an e-approximation of / 
can result in a significant change to For example, if an a 6 (0, 1) fraction of the distinct 
values have a total count C, and in the e-approximate sample all of them have a total count 
(1 + e)C, one might get / _1 (C) = instead of a. Thus, an exact sample is required in order 
to approximate the inverse distribution. 

The algorithms we present in this paper are exact sampling algorithms for the most 
general case of streams with deletions. Thus, they are useful for calculating both forward and 
inverse distribution statistics. We describe applications which require exact sampling next. 

Data stream algorithms are often used in network traffic analysis. They are placed in 
DSMSs - Data Stream Management Systems. Applications of dynamic exact sampling in 
these environments include detecting malicious IP traffic in the network. Karamcheti et al. 
|16j showed that inverse distribution statistics can be used for earlier detection of content 
similarity, an indicator of malicious traffic. Inverse distribution queries can also be used for 
detecting denial-of-service (DoS) attacks, specifically SYN floods, which are characterized as 
flows with a single packet (often called rare flows). 

Exact dynamic sampling is beneficial in geometric data streams as well [HI E] ■ I n these 
streams, the items represent points from a discrete d-dimensional geometric space {1, . . . , A} . 
Our algorithms can also run on data streams of this type. 

Previous Work 

Most previous work that has been done on sampling of dynamic data streams that support 
deletions was limited to approximating the forward distribution [2J (TUJ HB] . Works on the 
inverse distribution include a restricted model with total counts of 0/1 for each value [H] 121] , 
and minwise-hashing, which samples uniformly the set of items but does not support deletions 
[5\. The work [11] supports only a few deletions. 

Inverse distribution queries in streams with multiple deletions were supported in a work 
by Frahling et al. [5] [7] , who developed a solution for Strict Turnstile data streams and used 
it in geometric applications. Cormode et al. [1] developed a solution for both the Strict 
Turnstile and the Non-strict Turnstile streams. However, they did not analyze the required 
randomness or the algorithm's error probability in the Non-strict model. Jowhari et al. |14j 
studied L p samplers |18j and built an Lq sampler for Non-strict Turnstile streams. 

Our Results 

Previous works [H O El H] constructed data structures for sampling only a single element. 
In order to use their structures for applications that require a sample of size K, one has 
to use O(K) independent instances of their structure. The obtained sample holds elements 
chosen independently, and it might contain duplicates. 

Running the sampling procedure O(K) times in parallel and inserting each element in the 
stream as an input to O(K) instances of the data structure, results in an enormous process 
time. Typical stream queries require a sample of size K = f2( log y) where the results are 
e approximated, and 1 — 6 is the success probability of the process. Thus, the number of 
operations required to process each element in the stream is multiplied by many orders of 
magnitude. For typical values such as e = 10~ 2 , 8 = 10 -6 the number of operations for 
each element in the stream is multiplied by about 200,000. The structures of jU EJ |7J [H] 
cannot be used for obtaining a K size sample due to this unfeasible process load. We present 
algorithms that can accomplish the task. 
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Our contributions are as follows. 
h We construct a data structure that can extract a sample of size K, whereas previous works 
returned only a single sampled element. Using a single data structure reduces significantly 
the process time and the required randomness. Thus, our algorithms are feasible for 
data stream applications, in which fast processing is crucial. This optimization enables 
applications that were previously limited to gross approximations of order e = 1CU 2 to 
obtain a much more accurate approximation of order e = 1CP 6 in feasible time. 
We present solutions for both Strict Turnstile data streams and the most general Non- 
strict Turnstile data streams. We are the first to provide algorithms with proved success 
probability in the Non-Strict Turnstile model. For this model we develop a structure 
called the Non-strict Bin Sketch. 

We provide more efficient algorithms in terms of the randomness required. Our algorithms 
do not require fully independent or min-wise independent hash functions or PRCs, 
i We introduce the use of 0(log y)-wise independent hash functions to generate a sample 
with 1 — S success probability for any S > 0. Our method outperforms the traditional 
approach of increasing the success probability from a constant to 1 — S by ©(logy) 
repetitions. We utilize a method of fast evaluation of hash functions which reduces our 
processing time to 0((loglog |) 2 ), while the traditional approach requires 0(log |) time. 

A comparison of our algorithms to previous work is presented in Table [I] We introduce 
two algorithms, denoted FRS (Full Recovery Structure) and e-FRS according to the recovery 
structure used. The performance of our algorithms is summarized in the table as well as in 
Theorems [6] and 10 Our algorithms improve the update and the sample extraction times. 



Table 1 Comparison of our sampling algorithms for streams with deletions to previous work. 
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Notes: FRS and e-FRS support sample sizes K — 51 (log |) and K — log |) respectively. Memory 
size is given in bits. Update time is the number of operations per element. Sample extraction time 
is for a K size sample. S and NS denote the Strict and Non-strict models respectively. In |3] S is the 
deterministic approach, NS is the probabilistic approach, and the hash function is fully independent. 
In |14) the extraction time is under the assumption of sparse recovery in linear time. 



2 Preliminaries 
2.1 Data Stream Model 

Our input is a stream in the Turnstile data stream model |19j . The Turnstile data stream 
consists of N pairs (xi, Ci) where Xi is the element's value, and Ci is its count. The elements 
Xi are taken from a fixed universe U = [m] (where [m] — {0, . . . , m — 1}). The counts Ci are 
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taken from the range [— r, r]. Let t be the time we process the t'th pair in the data stream. 
We define the total count of value k at time t to be Cfc(t) = J2i<t- x =fe c «- We assume that 
Vi, A;, Cfe(t) € [— r, r]. In the Strict Turnstile model, Ck(t) > at all times t. In the Non-strict 
Turnstile model, Cfc(t) may obtain negative values. 

A sample S drawn at time t is a subset of {(fc, Ck(t)) : Cfe(t) 7^ 0}. Note that Cfc(t) is the 
exact total count at time t. To simplify the notation we consider sampling only at the end of 
the process, and denote Ck = Ck(N). 

2.2 Problem Definition 

Given a data stream of N pairs (xi, c,-), which is either a Strict or Non-strict Turnstile data 
stream, assume there is an application that needs to perform some queries on the stream 
such as calculate the inverse distribution. This application allows an e G (0, 1) additive 
approximation to the answers, and a 1 — 6 for 6 £ (0, 1) success probability of the process. 
The application predefines the size of the sample it requires to be K, where K might depend 
on e and 8. The input to our sampling algorithm is the data stream, K and 8. 

Let D = {(k, Ck) ■ Ck 7^ 0} be the set of all pairs of values and their total counts in the 
stream at the time we sample. The output of our sampling algorithm is a sample S C D of 
size \S\ — Q(K), generated with probability 1 — 5. Note that the size of the sample returned 
is of order K and not precisely K . 

Applications typically require a sample of size K = Q,{\ log |) for an e approximation 
with 1 — 5 success probability. However, our algorithms FRS and e-FRS, support even smaller 
sample sizes, fi(log |) and log |) respectively. 

We define the following two "flavors" of samples. 

► Definition 1. A t-wise independent sample is a random set S C D in which each subset of 
t distinct elements in D has equal probability to be in S. 

► Definition 2. Let X C D be a sample obtained by a i-wise independent sampling algorithm, 
and let e € (0, 1). A subset S C X of size (1 — e) \X\ < \S\ < \X\ is a (i, e) -partial sample. 

Our FRS algorithm returns a t-wise independent sample, where t = f2(log ^). Our e-FRS 
algorithm returns a (i, e)-partial sample. This means there is a fractional bias of at most e in 
the sample returned by e-FRS. 

The key insight is that a t-wise independent sample for t = f2(log 4) guarantees the same 
approximation as an independent sample. For example, a sample of size K — ft(-h log 4) 
enables the approximation of the inverse distribution queries and the Jaccard similarity 
coefficient up to an additive error of e with 1 — 6 success probability. The (t, e)-partial sample 
can be used for the same stream statistics because it only adds an error of at most e to the 
approximation. This is demonstrated in Sect. [5] 

2.3 Hash Function Techniques 

Throughout the paper we make extensive use of t-wise independent hash functions for 
t = e(logi) and t = 6(logf). We use the following techniques in our analysis. For 
bounding the error probability we use the Moment Inequality for high moments and the 
estimation of [TJ] (see Appendix [AJ). For hash evaluations we use the multipoint evaluation 
algorithm of a polynomial of degree less than t on t points in 0(tlog 2 t) operations [25]. 
Thus, evaluation of a <-wise independent hash function takes C*(log 2 t) amortized time per 
element by batching t elements together. This is the time we use in our analysis whenever 
we evaluate these hash functions. 
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Algorithm Overview 



In this section we provide an overall view of our sampling algorithm (see Fig. ft 
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Figure 1 Pseudo codes of data structure update and extracting a K size sample. 



In the first phase of our sampling algorithms the elements in the data stream are mapped 
to levels, in a similar way to the method in @J. Each element is mapped to a single level. 
The number of elements mapped to each level decreases exponentially. Thus, we can draw a 
sample of the required size regardless of the number of distinct elements in the stream. 

In each level we store a recovery structure for K elements, which is the core of our 
algorithm. We present two alternative structures, FRS and e-FRS, with trade-offs between 
space, update time and sample extraction time. Each structure consists of several arrays with 
a Bin Sketch (BS) in each cell of the arrays. Assume a structure (FRS or e-FRS) contains 
the set of elements X, where K < \X\ < K and K — Q(K). FRS enables to extract all the 
elements it contains, returning a t-wise independent sample, where t — ©(log j). e-FRS 
enables to extract at least (1 — e) \X\ of the \X\ elements it contains, returning a (t, e)-partial 
sample. The problem of recovering the \X\ elements is similar to the sparse recovery problem 
[121 \2'6\ , however in our case there is no tail noise, and we limit the amount of randomness. 

The sample S is the set of elements extracted from FRS (or e-FRS) in a single level. In 
order to select the level we use a separate data structure that returns Lq, an estimation of 
the number of distinct elements in the stream. This data structure is updated as each stream 
element arrives, in parallel to the process described above. 

Extracting a sample is performed as follows. First we query the data structure of Lq. 
Then we select a level I* that should have Q(K) elements with probability 1 — S. We recover 
the X elements from that level, or at least (1 — e) \X\ of them, depending on the recovery 
structure, with probability 1 — 5. The elements recovered are the final output sample S. 

4 Sampling Algorithms 

4.1 Bin Sketch - Data Structure Building Block 

In this section we present the Bin Sketch (BS), a tool used as a building block in our data 
structure. Given a data stream of elements {(xt, c,)} ig rjyi, the input to BS is a substream 
{(xj, Ci)} ieI where I C [N). BS maintains a sketch of the substream. Its role is to identify 
if the substream contains a single value, and if so, to retrieve the value and its total count. 
Note that a single value can be obtained from a long stream of multiple elements if all values 
but one have zero total count at the end. 
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Strict Bin Sketch (BS S ) 

We describe the Bin Sketch for the Strict Turnstile model, previously used in [7] [5]. Given a 
stream of pairs {(xi, c/)}i & i, the Strict Bin Sketch (BS S ) consists of three counters X, Y 
and Z: X = £ ie7 a, Y = J2 ie i °i x h Z = *£ ie i c ^?- 

BS S properties are summarized as follows: The space usage of BS S is 0(log(mr)) bits. 
If BS S holds a single element, then we can detect and recover it. BS S holds a single element 
& X ± 0, Y ^ 0, Z ± and XZ = Y 2 . The recovered element is (k, C k ), where k = Y/X 
and Cfe = X. BS S is empty X = 0. For proof see [5]. 

Non-strict Bin Sketch (S5„ s ) 

We provide a generalization to Bin Sketch and adjust it to the Non-strict Turnstile model. 
If BS S contains two elements, then XZ ^ Y 2 . Thus, there is a distinction between BS S 
with two elements and BS S with a single element. However, we cannot always distinguish 
between three or more elements and a single element. 

A previous attempt to solve the problem used a deterministic collision detection structure 
[3] . This structure held the binary representation of the elements. However, this representation 
falsely identifies multiple elements as a single element on some inputs^ 

In order to solve the problem we use a new counter defined as follows. 

► Lemma 3. Let % = {h: [m] — » [q]} be a t-wise independent family of hash functions, and 
let h be a hash function drawn randomly from %. Let T — ^2 k Cth(k) be a counter. Assume 
T is a sum of at most t — 1 elements. Then for every element (A/, C' k ), where k' £ [m], if T 
is not the sum of the single element (k' , C' k ), then Pr^ e ^[T = C' k h(k')\ < l/q. 

Proof. We subtract C' k h{k') from T and obtain T = T - C' k h{k'). If T is not the sum of the 
single element (k',Cl), there are between 1 and t elements in T". The hashes of those elements 
are independent because h is i-wise independent. We therefore have a linear combination of 
independent uniformly distributed elements in the range [q], thus Prft e %[T' = 0] < l/q. 

The Non-strict Bin Sketch (BS ns ) consists of four counters: X,Y,Z and an additional 
counter T, as defined in Lemma [3] The space of BS ns is 0(log(mrq)) bits. The time to 
insert an element is 0(log 2 t) since we evaluate a t-wise independent hash function. There 
are other ways to maintain a sketch such as keeping a fingerprint. However, fingerprint 
update takes 0(log m) time, which depends on the size of the universe, while the update 
time of BS ns depends on t which is later set to t — Q (log y). 

► Corollary 4. Three or more elements in BS ns may be falsely identified as a single element. 
The error probability is bounded by l/q, when there are at most t — 1 elements in the bin. 

BS is placed in each cell, which we refer to as bin, in each of the arrays of our data 
structure. Its role is to identify a collision, which occurs when more than one element is in 
the bin. If no collision occurred, BS returns the single element in the bin. BS supports 
additions and deletions and is oblivious of the order of their occurrences in the stream, which 
makes it a strongly history independent data structure [T7] [50]. Its use in our sampling data 
structure makes the whole sampling structure strongly history independent. 



1 For example, for every set of 4 pairs {{2k, 1), (2k + 1,-1), (2k + 2,-1), (2k + 3, 1)}, the whole structure 
is zeroed and any additional pair (k ,C k ) will be identified as a single element. 
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4.2 L Estimation 

The number of distinct values in a stream with deletions is known as the Hamming norm 
Lq = \{k S [m] : Ck 0}| |3J. Lq Estimation is used by our algorithm for choosing the level 
I* from which we extract the elements that are our sample. We use the structure of Kane et 
al. |15j . which provides an e-approximation Lq with 2/3 success probability for both Strict 
and Non-strict streams. It uses space of logm(log ^ + loglog(r))) bits and 0(1) update 
and report times. 

Our algorithm requires only a constant approximation to Lq. Hence the space is O (log to- 
log log (r))) bits. However, we require 1 — 6 success probability for any given 6 > and not 
only 2/3. For estimation of Lq, where Lq — £1(^2 log j) we use methods similar to those in 
the rest of the paper. We keep r = 0(log y) instances of Kane et al. structure, and use 
one 6(log |) independent hash function to map each stream element to one of the instances. 
With constant high probability all instances have approximately the same number of elements. 
Lq is the median of the estimations obtained from all instances multiplied by r, and is a 
constant approximation to Lq with probability 1 — 6. 

Thus we obtain an Lq estimation algorithm with 0(logmloglog(r) log 4) bits of space, 
0((loglog |) 2 ) update time, since one 0(log|) independent hash function is evaluated, 
0(log |) reporting time, and 1 — 6 success probability. These requirements are lower than 
their corresponding requirements in the other phases of our algorithm. 



4.3 Mapping to Levels 

The first phase of our algorithm is mapping the elements in the data stream to levels. We 
use a family of i-wise independent hash functions H = {h: [m] — > [M]} for t = 0(log |). 
We select h G T~L randomly and use it to map the elements to L = log-^ M levels, for 
A G (0, 1). Typical values are A = 0.5 and M = 2m. The mapping is performed using a 
set of hash functions hi(x) — j& , for I G [L\. An element x is mapped to level I 4=> 
(hi(x) = A hi + i(x) ^ 0). Note that each element is mapped to a single level. 

Using this mapping, the set of elements mapped to each level is t-wise independent. It 
follows that in order to obtain a i-wise independent sample, we can extract all elements from 
any level we choose. However, we must select the level independently of the elements that 
were mapped to it. If any event that depends on the specific mapping influences the level 
selection, the sample becomes biased. Biased samples appeared in some previous works. 

In order to choose the level regardless of the mapping, we use the number of distinct 
elements Lq. We obtain an estimation Lq from the Lq estimation structure, where Lq < Lq < 
(xLq for a > 1 with 1 — 6 probability, and choose the level where K elements are expected. 

► Lemma 5. Let the elements be mapped to levels using a hash function h selected randomly 
from a t-wise independent hash family T~L for t = fi(log |). Assume there is an estimation 
Lq < Lq < cxLq for a > 1, and K = f2(log \). Then the level I* for which ^L X l " +1 (l - A) < 
2K < ^LqX (1 — A) has K to + 1)K elements with probability at least 1 — 6. 

Proof. See Appendix |B] a 

Let X be the set of elements in level I*, from which we choose to extract the sample. We 
denote K — (3? + 1)K the maximal number of elements in level I*. With probability at least 
1-6, K < \X\ < K. For typical values a = 1.5 and A = 0.5, K < \X\ < IK. 
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4.4 Full Recovery Data Structure (FRS) 

In this section we present the Full Recovery Data Structure (FRS) that is placed in each of 
the levels. Since the recovery structure is the core of our sampling algorithm, we refer to the 
entire algorithm with FRS in each level as the FRS algorithm. The sampling algorithm can 
be summarized to the following theorem. 

► Theorem 6. Given a required sample size K = f2(log j) an d & G (0; 1); FRS sampling 
algorithm generates a ©(log V)-wise independent sample S with 1 — 5 success probability. 
The sample size is K < \S\ < K, where K = <d(K). For both Strict and Non-strict 
data streams FRS uses 0(Klog t log (mr) log (m)) bits of space, 0(log 4-) update time per 
element, 0(log ^ logm) random bits and 0(Klog time to extract the sample S. 

FRS is inspired by Count Sketch [TJ. It is composed of r = 0(log^) arrays of size 
s = O(K). We refer to the cells in the array as bins. Each input element is mapped to one 
bin in each of the r arrays. In each bin there is an instance of BS S . We use r hash functions 
drawn randomly and independently from a pairwise independent family % = {h: [m] — > [s]}. 
The same set of hash functions can be used for the instances of FRS in all levels. The 
mapping is performed as follows. 

Let B[a, b] for a £ [t] and b £ [s], be the b'th bin in the a'th array. Let the hash functions 
be h\ . . . h T . Then (xi, Cj) is mapped to B[a, h a (xi)] for every a G [r]. We say that two 
elements collide if they are mapped to the same bin. 

► Lemma 7. Let FRS with at most K elements have r = log ^ arrays of size s = 2K . Then 
with probability at least 1 — 5 for each element there is a bin in which it does not collide with 
any other element. 

Proof. See Appendix [B] 

► Corollary 8. In the Strict Turnstile model all elements in FRS can be identified and 
recovered with probability at least 1 — 5. 

For recovery, we scan all bins in all arrays in FRS and use BS S to extract elements from 
all the bins that contain a single element . According to Lemma [JJ all elements in FRS can 
be identified with success probability 1 — 5. We verify success by removing all the elements 
we found and scanning the arrays an additional time to validate that they are all empty. 
Removing an element (x, c) is performed by inserting (x, — c) to the corresponding bins. 

4.4.1 Non-strict FRS 

We now present the generalization of FRS to Non-strict streams. Once again we use BS S , 
but we add to our sample only elements that are consistently identified in multiple arrays. 

► Lemma 9. Let FRS with at most K elements have t = 5 log ^ arrays of size s = 8K. In 
the Non-strict data stream model, all elements inserted to FRS and only those elements are 
added to the sample with probability at least 1 — 5. 

Proof. We extract a set A of candidate elements from all BS s s that seem to have a single 
element in the first log ^ arrays in FRS. A contains existing elements, that were inserted to 
FRS, and falsely detected elements that are a result of a collision. |^4| < K log It follows 
from Lemma [7] and Corollary [8] that all of the existing elements can be recovered from the 
first log y arrays with probability 1 — 5/2 (increasing the arrays size reduces the probability 
of a collision). Hence A contains all existing elements with probability 1 — 5/2. 
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Next we insert to our output sample all candidates that we detect in at least half of their 
bins in the r' = 4 log =j remaining arrays of FRS. There are two types of possible errors: 
not reporting an existing element (false negative) and reporting a falsely detected element 
(false positive). 

False negative: We show that with high probability existing elements are isolated in 
at least half of the their bins in the r' arrays. Let C£ be the event that element k collides 
with another element in array a. Pr[C£] < K ■ i = |. Let Ck be the event that element k 
collides with another element in at least half of the r' arrays. The hash functions of the 

different arrays are independent and therefore: Pr[C fe ] < ( T f /2 ) Pr[C£] r '/ 2 < (j^j , where 

(7-72) < ^ T ^ s use d- Let C be the event that there is an existing element that collides with 

another element in at least half of the t' arrays. Pr[C] < K ■ Pr[Cfe] < K (jgj < f ■ 

False positive: If a falsely detected clement from A is added to the sample then there 
is a collision in at least half of its bins in the r' arrays. Let £§ be the event that there is 
an element in bin b of array a. Pr[££] < (^) (j) = g. Let £k be the event that there are 
elements in the bins corresponding to a falsely detected element k in at least half of the 

t' arrays. Pr[£ fe ] < (/ /2 ) Pr[£ fc a ] T '/ 2 < 2 T ' (2~ 3 ) T ' /2 = 2-°- 5t ' = . Let £ be the event 

that there is an element from A that was falsely identified in at least half of its bins. Using 
the union bound we get: Pr[£] < \A\ ■ Pr[£ fc ] < Klog f • f < £. 

We conclude that the probability of a mistake is bounded by: 5/2 + Pr[C] + Pr[£ ] < 6. -4 

4.5 e-Full Recovery Data Structure (e-FRS) 

In this section we present the e-Full Recovery Data Structure (e-FRS) that enables to recover 
almost all elements inserted to it. We refer to the entire algorithm with e-FRS placed in 
each of the levels as e-FRS algorithm. The sampling algorithm can be summarized to the 
following theorem. 

► Theorem 10. Given a required sample size K = log i), for 8 E (0, 1) and e G (0, 1), 
e-FRS sampling algorithm generates a (t,e) -partial sample S for t — ©(log 4) with 1 — 6 
success probability. The sample size is (1 — e)K < \S\ < K, where K = Q(K). For 
both Strict and Non-strict data streams e-FRS requires 0((loglog ^) 2 ) update time per 
element, 0(log 4- log to) random bits and 0{K) time to extract the sample S. The space is 
0(K\og (mr) log (to)) bits for Strict data streams and 0(K\og ( I ^ r ) log (to)) bits for Non- 
strict streams. 

e-FRS is composed of r = 2 arrays of size s = O(K). As in FRS, each input element 
is mapped to one bin in each of the arrays. In each bin of each array we keep an instance 
of BS S or BS ns according to the input data stream. The mapping is performed using 
two hash functions drawn randomly and independently from a t-wise independent family 
H = {h:[m]-> [s]} for t = 6(log f ). 

Let X be the set of elements in e-FRS, \X\ < K. A fail set F C X is a set of / elements, 
such that each element in the set collides with other elements from the set in both its bins. 
The elements in a fail set F cannot be extracted from e-FRS. Analyzing the existence of 
a fail set is similar to analyzing failure in a cuckoo hashing [55] insertion. We bound the 
probability that there is a fail set of size / using the following (revised) lemma of Pagh and 
Pagh |H]. 

► Lemma 11 (|21J,Lernma3.4). p or i wo functions 11,12 - U — > [R] and a set S C U, let 
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G(ii, ii-, S) — (A, B, E) be the bipartite graph that has left vertex set A — {a\, . . . , an}, right 
vertex set B = {b\, . . . , and edge set E — {e x \ x € S}, where e x = (a^M, &i 2 ( a ))- 

For each set S of size n, and for ix, t2 '■ U — > [An\ chosen at random from a family that is 
t-wise independent on S, t > 32, the probability that the fail set F of the graph G(i%, ii, S) 
has size at least t is n/2 n (*). 

► Corollary 12. Let e-FRS with at most K elements have 2 arrays of size s — AK . Let the 

mapping be performed by two t-wise independent hash functions for t = clog constant c 
and t > 32. The probability that there is a fail set of size at least t is bounded by 8. 

Proof. The more elements in e-FRS, the higher the probability that there is a fail set of 
some fixed predefined size. The probability is K/2 c ' clo s t < 5 for some constants c, c . < 

The following algorithm identifies all elements in e-FRS that do not belong to a fail set. 

1. Initialize the output sample 5 = and a queue Q = 0. 

2. Scan the two arrays in e-FRS. For each bin b, if BS holds a single element, Enqueue(Q , b). 

3. While Q ^ 0: 

3.f . b <— Dequeue(Q). If BS in b holds a single element: 

3.1.1. Extract the element (k,Ck) from BS in b. 

3.1.2. S = SlJ{(k,C k )}. 

3.1.3. Subtract (k, Ck) from BS in b, where b is the other bin k is hashed to. 

3.1.4. Enqueue(Q,b). 

4. Return S. 

► Lemma 13. All elements that do not belong to a fail set are identified by the algorithm. 
Proof. See Appendix [B| -4 

► Lemma 14. The recovery algorithm takes O(K) time. 

Proof. If the algorithm is implemented with hash computations for finding the other bin an 
element is hashed to, it takes 0(-ftT(loglog y) ) time. Using an additional counter in each 
BS reduces the time to 0{K). See Appendix [B] for the complete proof. < 

Let X be the elements in e-FRS, K < \X\ < K, K = <d(K). In order to recover all but 
e\X\ of the elements we require K = - log |). If if is smaller, we recover all but at most 
0(max{e \X\ , /}) of the elements, where / = 0(log y) is the size of the fail set. 

4.5.1 Non-strict e-FRS 

In the Non-strict Turnstile model we keep BS ns in each bin, and we set the range of the hash 
function to q = ) and its independence to t' = 0(log 4-), the same as the independence 
of the hash functions we use when mapping to the bins in e-FRS. The same hash function can 
be used for all BS ns s. Recall that if BS ns contains a single element, this element is extracted 
successfully. If BS ns contains less than t' elements, an event called a small collision, the 
probability of an error is at most 1/q. If BS ns contains t' elements or more, an event called 
a large collision, we do not have a guarantee on the probability of an error. 

► Lemma 15. Let e-FRS with at most K elements have 2 arrays of size s = AK . Let the 

mapping be performed by two t-wise independent hash functions for t = 2 log ^ . Let each 
bin contain BS ns with q — and t' = t. The probability of no false detections during the 
entire recovery process is at least 1 — S. 
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Proof. First we bound the probability of a large collision in e-FRS. Let e-FRS have \X\ < K 
elements. Let Eh be the event that there is a large collision in bin b. Since t = t', every t' 
elements that appear in a large collision are mapped there independently. Thus, Pr[£&] < 

(f,) (-)*' < i^Y' < (Jr)*' < (jff- Using the union bound, the probability that 

there is a large collision in any bin is at most 8/2. Hence we can consider only small collisions. 

The total number of inspections of bins with collisions during the recovery process is at 
most 2K . Therefore the probability of detecting at least one false element as a result of a 
small collision is at most 2K^ = 6/2, and the probability of any false detections is 8. -4 

► Corollary 16. If BS ns with q = and t' = 0(log is placed in each bin, the recovery 
procedure of the Strict Turnstile model can be used also for Non-strict Turnstile data streams 
with 1 — 8 success probability. 

5 Applications and Extensions 
Inverse Distribution 

The samples generated by the algorithms FRS and e-FRS can be used to derive an additive 
e-approximation with 1 — 8 success probability for various forward and inverse distribution 
queries. For example, consider Inverse point queries, which return the value of / _1 (i) for a 
query frequency i. The samples from FRS and e-FRS can be used to obtain an approximation 
in [/ _1 (i) — e, / _1 (i) + e] for every frequency i. We can approximate Inverse range queries, 
Inverse heavy hitters and Inverse quantiles queries in a similar way. 

► Lemma 17. Let S be a (t, e') -partial sample of size \S\ = f2(^logj) f or 6 € (0, 1) 7 

e' = 6(e), 8 e (0, 1), and t = ft (log ±). The estimator f-\i) « H k ^ s -^=^1 provides an 
additive e-approximation to the inverse distribution with probability at least 1 — 8. 

Proof. See Appendix [B] -4 
Union and Difference 

Let DS r ^Di be the data structure obtained from data stream Di using the random bits r. 
The union of streams D\, Di is DS rt D 1 uD 2 = DS r .o 1 + DS rt D 2 , where the addition operator 
adds all BSs in all bins of all arrays. Our sampling algorithm can extract a sample from a 
union of data streams. This feature is useful when there are multiple entry points and each 
of them can update its own data structure locally and then a unified sample can be derived. 

Sampling from the difference of streams Z)5 , i . ! £) 1 _d 2 = DS r ,Dx ~~ DS r- D 2 is similar. Note 
that even if D\ and Di are Strict Turnstile data streams, their difference might represent 
a Non-strict Turnstile stream. Hence our structures for the Non-strict Turnstile model are 
useful for both input streams in the Non-strict model and for sampling the difference. 
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A Approximation Using High Moments 

For a random variable Z and an even number /, 

Pr[|Z - E[Z}\ > t] = Pr[(Z - E[Z}) 1 >t l }<^ 

where A' = E[(Z — E[Z)) 1 ] is the Z'th central moment of Z. We use the estimation of [T5] to 
A', where Z is a sum of /-wise independent indicator variables: A' < S(6iy l+1 '' 2 E[Z]^ l+1 '' 2 . 
This implies: 



Pr[|Z- E[Z]| > aE[Z}} < 



48Z / 61 x ( '~ 1)/2 



o 



► Lemma 18. Let Z be a sum of l-wise independent indicator variables with E[Z] = log 4 
for some < e, S G (0, 1) and constant c. Let I — clog | /or some constant c be an even 
number. Then for big enough constants c,c: ~Pt[\Z — E[Z] \ > eE[Z]] < S. 



Proof. 

ic log 1 / 6c log \ 



Pt[\Z - E[Z}\ > eE[Z}} < 



(clogi-l)/2 



485 log | f 6£ \(^si-D/2 



c 



For e > poly (S). -4 

Note that for a constant a, in order to prove Pr[|Z - E\Z\ \ > aE[Z}} < 5, E[Z] = ft (log |) 
is sufficient. 

B Proofs from Section [4] 
B.l Proof of Lemma [5] 

Proof. Let X\ be a random variable that indicates the number of elements k in the stream 
for which (hi(k) = A hi(k) 7^ 0). I.e. X[ is the number of elements in level I. H is a i-wise 
independent hash family for t = 0(log |). Therefore for each Z e L and fc in the stream, 
Yv heH [hi{k) = A fcj+i(Jfc) + 0] = A' - A' +1 = A ; (l - A). We denote p t = \\l - A). 

Lo < Lq < aLo implies \Lq < Lq < Lo and we obtain: ^I/ope+i < 2K < ^Lqpi* < 
Lopi* < L0P1* < ^x^- The expected number of elements in level / is -EpQ] = LqPi- Hence 
for level I* , 2K < E[X t *} < ^L. 



We write E[Xi*] = /3K for some 2 < /3 < From Lemma 18 we get: Pr[|X/» - E[X t ,]\ > 



K] = Pr[\Xi* ~ E[Xi*] \ > jjE[Xi*]] < S. We can use the lemma since Xi* is a sum of i-wise 
independent variables, t — 6(log |), and K = O(log |). 

B.2 Proof of Lemma [7] 

Proof. First we bound the probability that a specific element k collides with another 
element in a specific array a. Let Cjy be the event that elements k and j collide in 
array a. Since pairwise independent hash functions are used to map the elements to 
bins, Vk,j,a Pr[Cjy] = j = Let CjJ be the event that fc collides with any element in 
array a. Pr [C£] = Pr[3j Cjy < \J# k Pr[C&] < X • Pr[C&] = |. 
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Now we prove that with probability 1 — 5/ K there is an array in which no element collides 
with k. Let \Ck\ be the number of arrays in which k collides with another element. We want 
to show \Ck\ < t with high probability. Let X be the set of elements in FRS. We know 
\X\ < K. The hash functions of the different arrays are independent. Therefore Pr[|Cfc| = 
r] = Pr[Va G [r], Cjf] < Pr[Va G [r], C% \ \X\ = K] = TT oe[T] Pr[Cg | \X\ = K] < (±) r = |. 

We conclude with: Pr[3fc, k collides in all arrays] = Pr[3fc, |C&| = t] < [J k Pr[\Ck\ = 
r] < S. 

B.3 Proof of Lemma [131 

Proof. If the set of unidentified elements is not a fail set, then one of them is isolated in a 
bin. Let x be the element and b be the bin in which x is the single element. If x was isolated 
prior to the first scan over all bins, bin b would have been identified in the first scan. If 
x became isolated during the recovery process, then all other elements in b were identified 
beforehand. When the last of them was identified and removed from b, x became isolated, 
and b was inserted to the queue. Each bin inserted to the queue is removed from the queue 
before the process ends, and hence when b was removed, x was identified. Identifying x 
results in a smaller set of unidentified elements. The identification process can proceed until 
all elements are identified or none of them are isolated, i.e. they all belong to a fail set. -4 

B.4 Proof of Lemma [H] 

Proof. The number of operations in the initial scan is linear in the number of bins in e-FRS, 
which is O(K). The number of bins inserted to Q in the process is O(K), because a bin is 
inserted only when an element is extracted and added to the sample. Thus, apart from the 
operation of identifying the other bin an element is hashed to, all other operations take a 
total of O(K) time. 

Assume the phase of finding the other bin an extracted element k is hashed to, is 
implemented in the algorithm by evaluating the hash function on k. This evaluation occurs 
O(K) times. The hash functions are i-wise independent, where t = 0(log ^). Thus, in this 
implementation the recovery time for all O(K) elements is 0(K (log log ^j) 2 )- 

The recovery time can be reduced to 0(K) by using an additional counter W in each 
BS. Let hi, h-2 be the two hash functions that map the elements to the bins in the two 
arrays. When inserting an element Cj) to BS in array a € {1, 2}, the following update is 
performed: W <— W + Cihs- a (xi). The space and update time required by the algorithm 
remain of the same order, since the update takes an additional O(l) time and the space of 
W is 0(log (mr)) bits. 

If BS in array a contains a single element (k, Ck), then W = Ckh 3 _ a (k). Thus, if (k, Ck) 
is extracted from BS in array a we obtain its location /i3_ (/c) in the other array 3 — a 
without evaluating a hash function. The recovery time is O(K) for all O(K) elements. -4 

B.5 Proof of Lemma [171 

Proof. Our estimator is ~ \{kes ^c k -t}\ ^ ^ g nee( ^ ^ Q p rove it is an e-approximation 

to /-!(») 

~ j{fc: Cfcj40}| ' ^ e fr act i° n °f distinct values with total count equals i, with proba- 
bility 1 - 5. 

We prove that it is an e-approximation when all elements are recovered from the recovery 
structure, i.e. when FRS is used. Later we relax the assumption that all elements are 
recovered, and thus prove that we get an e-approximation also when using e-FRS. Thus, an 
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e-approximation is provided with 1 — 8 success probability when using both our sampling 
algorithms. 

The elements recovered are from a specific level I* that depends on Lq. For value 
k, Cfe 7^ 0, we define a random variable Yfe. Y k = 1 if (k,Ck) is mapped to level I*. 
Pr[Y k = 1] = A r (1 - A), and we denote p t , = A r (l - A). 

Let Y = Yfc be the number of elements in level /*. For now assume that all elements 
in the level are recovered, i.e. Y — \S\. Later we relax this assumption. E[Y] is the expected 
sample size obtained from level I*. Thus £^[15*1] = E[Y] = L^pi- . We choose I* as a level 
with G(K) elements, i.e. E[Y] = Q{K) = fi(^log \). 



{Yfe} are i-wise independent, where t — ©(log I). E[Y] = \ log 4). From Lemma 



we obtain Pr[|F - E[Y]\ > e'E[Y]} < 5', for e' 
later. Thus, with probability 1 — 8' , 



18 



9(e) and 8' = 0(8) that will be determined 



(l-e')E[\S\]<\S\<(l + e')E[\S\] 



Let Fi = ^2 k . c k =i^k, be the number of elements in S with frequency i. E[Fi] = 
\{k: C k = i}\ -pf = f-\i) ■ E[\S\]. We get f-\i) = f^. If all elements in the level are 
recovered, the estimator of can be written as: / _1 (i) w 

Hence we would like to prove: 



Pr 



|5| 



> £ 



< <5 



(1) 



^jj^p is an e'-approximation to gpfn > i-e. Pr / 1 (i) — pjg-p 



> e' 



< 8' since 



-e' < 



E[Fi] 



< m\ < 



E[Fi) 



< 



E[Fi\ 



(l + e')E[\S\] ~ \S\ ~ (1-^[|S|] " E[\S\] 



with probability 1 — 6' when / — 



E[\S\] 



< 1 — e'. Then by using Lemma 
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again we 



get 0, for a constant c, 2e'-additive error and 1 — 28' success probability. 

Now we relax the assumption that all elements in the level were recovered. The maximal 
bias occurs if e' \ S\ elements were not recovered, and all unrecovered elements had frequency i. 
I.e. \{k £ S: Ck = i}\ = Fi ± e' Thus, there is an additional additive error of e'. Setting 
e' = e/3, 8' — 8/3 completes the proof. 



